Execute: Cmd+Shift+Enter. New Chunk: Cmd+Option+I. Preview: Cmd+Shift+K Notes: The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
library(leaps)
library(modelr)
library(mgcv)
library(ggplot2)
library(DataExplorer)
library(dplyr)
library(GGally)
library(ggExtra)
library(caret)
library(reshape2)
library(RColorBrewer)
library(plotly)
library(PerformanceAnalytics)
library(corrplot)corrplot 0.84 loaded
Due to the large number of libraries in use I have provided system information.
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /anaconda3/lib/R/lib/libRblas.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plotly_4.8.0 RColorBrewer_1.1-2 reshape2_1.4.3 caret_6.0-81 lattice_0.20-38 ggExtra_0.8 GGally_1.4.0 dplyr_0.8.0.1
[9] DataExplorer_0.7.0 ggplot2_3.1.0 mgcv_1.8-26 nlme_3.1-137 modelr_0.1.3 leaps_3.0
loaded via a namespace (and not attached):
[1] httr_1.4.0 tidyr_0.8.2 jsonlite_1.6 viridisLite_0.3.0 splines_3.5.1 foreach_1.4.4 prodlim_2018.04.18 shiny_1.2.0
[9] assertthat_0.2.0 stats4_3.5.1 yaml_2.2.0 ipred_0.9-8 pillar_1.3.1 backports_1.1.3 glue_1.3.0 digest_0.6.18
[17] promises_1.0.1 colorspace_1.4-0 recipes_0.1.4 htmltools_0.3.6 httpuv_1.4.5.1 Matrix_1.2-15 plyr_1.8.4 timeDate_3043.102
[25] pkgconfig_2.0.2 broom_0.5.1 purrr_0.2.5 xtable_1.8-3 scales_1.0.0 later_0.8.0 gower_0.1.2 lava_1.6.5
[33] tibble_2.0.1 generics_0.0.2 withr_2.1.2 nnet_7.3-12 lazyeval_0.2.1 survival_2.43-3 magrittr_1.5 crayon_1.3.4
[41] mime_0.6 evaluate_0.12 MASS_7.3-51.1 class_7.3-15 rsconnect_0.8.13 tools_3.5.1 data.table_1.12.0 stringr_1.4.0
[49] munsell_0.5.0 networkD3_0.4 compiler_3.5.1 rlang_0.3.1 grid_3.5.1 iterators_1.0.10 rstudioapi_0.9.0 htmlwidgets_1.3
[57] crosstalk_1.0.0 igraph_1.2.4 miniUI_0.1.1.1 labeling_0.3 base64enc_0.1-3 rmarkdown_1.11 gtable_0.2.0 ModelMetrics_1.1.0
[65] codetools_0.2-16 reshape_0.8.8 R6_2.4.0 gridExtra_2.3 lubridate_1.7.4 knitr_1.21 stringi_1.3.1 parallel_3.5.1
[73] Rcpp_1.0.0 rpart_4.1-13 tidyselect_0.2.5 xfun_0.4
repr IRdisplay IRkernel
"0.19.2" "0.7.0" "0.8.15"
First we inspect the raw records using bash:
"","Sales","CompPrice","Income","Advertising","Population","Price","ShelveLoc","Age","Education","Urban","US"
"1",9.5,138,73,11,276,120,"Bad",42,17,"Yes","Yes"
"2",11.22,111,48,16,260,83,"Good",65,10,"Yes","Yes"
"3",10.06,113,35,10,269,80,"Medium",59,12,"Yes","Yes"
"4",7.4,117,100,4,466,97,"Medium",55,14,"Yes","Yes"
Now I load the data into r, and drop the “ID” column.
carseats <- read.csv("data/Carseats_org.csv", header = T, stringsAsFactors = T)
drops <- c("X")
carseats <- carseats[ , !(names(carseats) %in% drops)]
head(carseats, 10)Here I create two new dataframes to manage numeric and categorical data.
# get vectors of continuous and categorical cols
nums <- dplyr::select_if(carseats, is.numeric)
cats <- dplyr::select_if(carseats, is.factor)
nums[sample(nrow(nums), 10), ]Let’s get quick summaries of each:
Sales CompPrice Income Advertising Population Price Age Education
Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000 Min. : 10.0 Min. : 24.0 Min. :25.00 Min. :10.0
1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000 1st Qu.:139.0 1st Qu.:100.0 1st Qu.:39.75 1st Qu.:12.0
Median : 7.490 Median :125 Median : 69.00 Median : 5.000 Median :272.0 Median :117.0 Median :54.50 Median :14.0
Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635 Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
Max. :16.270 Max. :175 Max. :120.00 Max. :29.000 Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
ShelveLoc Urban US
Bad : 96 No :118 No :142
Good : 85 Yes:282 Yes:258
Medium:219
'data.frame': 400 obs. of 8 variables:
$ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
$ CompPrice : int 138 111 113 117 141 124 115 136 132 132 ...
$ Income : int 73 48 35 100 64 113 105 81 110 113 ...
$ Advertising: int 11 16 10 4 3 13 0 15 0 0 ...
$ Population : int 276 260 269 466 340 501 45 425 108 131 ...
$ Price : int 120 83 80 97 128 72 108 120 124 124 ...
$ Age : int 42 65 59 55 38 78 71 67 76 76 ...
$ Education : int 17 10 12 14 13 16 15 10 10 17 ...
'data.frame': 400 obs. of 3 variables:
$ ShelveLoc: Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
$ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
$ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
This command is to inspect the different data types in the data.
'data.frame': 400 obs. of 11 variables:
$ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
$ CompPrice : int 138 111 113 117 141 124 115 136 132 132 ...
$ Income : int 73 48 35 100 64 113 105 81 110 113 ...
$ Advertising: int 11 16 10 4 3 13 0 15 0 0 ...
$ Population : int 276 260 269 466 340 501 45 425 108 131 ...
$ Price : int 120 83 80 97 128 72 108 120 124 124 ...
$ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
$ Age : int 42 65 59 55 38 78 71 67 76 76 ...
$ Education : int 17 10 12 14 13 16 15 10 10 17 ...
$ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
$ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
[1] ""
[1] "Number of Columns: 11"
[1] "Number of Numeric Columns: 8"
[1] "Number of Categorical Columns: 3"
Here’s a quick way to examine general properties of the data:
I start out with a few general scatter plots.
plot_ly(data=carseats,
x=~Age,
y=~Sales,
mode = 'markers',
type = 'scatter',
color=~ShelveLoc) %>%
layout(title = "Age, Shelf Location, and Sales Scatter Plot", width=900) Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
This plot below shows good separation and a weak linear trend. These variables are worth investigating.
plot_ly(data=carseats,
x=~Price,
y=~Sales,
mode = 'markers',
type = 'scatter',
color=~ShelveLoc) %>%
layout(title = "Price, Shelf Location, and Sales Scatter Plot", width=900)Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
Here we inspect the density of the ‘Price vs Sales’ relationship:
plot_ly(data=carseats,
x=~Price,
y=~Sales,
mode = 'markers',
size = ~Price,
type = 'scatter',
colors = "Dark2",
alpha = .6) %>%
layout(title = "Price, US, and Sales Scatter Plot", width=900) Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.
I choose to normalize the numeric data in order to be able to plot each variable on the same scale. This will allow me to investigate the variation of each predictor relaticve to Sales.
preObj <- preProcess(nums, method=c("center", "scale"))
scaled.nums <- predict(preObj, nums)
head(scaled.nums)'data.frame': 400 obs. of 8 variables:
$ Sales : num 0.7095 1.3185 0.9078 -0.0341 -1.1849 ...
$ CompPrice : num 0.849 -0.911 -0.781 -0.52 1.045 ...
$ Income : num 0.155 -0.738 -1.203 1.12 -0.166 ...
$ Advertising: num 0.656 1.408 0.506 -0.396 -0.547 ...
$ Population : num 0.0757 -0.0328 0.0282 1.3649 0.51 ...
$ Price : num 0.178 -1.385 -1.512 -0.794 0.515 ...
$ Age : num -0.699 0.721 0.35 0.104 -0.946 ...
$ Education : num 1.183 -1.4882 -0.725 0.0382 -0.3434 ...
[1] ""
Sales CompPrice Income Advertising Population Price Age Education
Min. :-2.65440 Min. :-3.12856 Min. :-1.70290 Min. :-0.9977 Min. :-1.72918 Min. :-3.87702 Min. :-1.74827 Min. :-1.48825
1st Qu.:-0.74584 1st Qu.:-0.65049 1st Qu.:-0.92573 1st Qu.:-0.9977 1st Qu.:-0.85387 1st Qu.:-0.66711 1st Qu.:-0.83779 1st Qu.:-0.72504
Median :-0.00224 Median : 0.00163 Median : 0.01224 Median :-0.2459 Median : 0.04858 Median : 0.05089 Median : 0.07268 Median : 0.03816
Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
3rd Qu.: 0.64575 3rd Qu.: 0.65375 3rd Qu.: 0.79834 3rd Qu.: 0.8067 3rd Qu.: 0.90693 3rd Qu.: 0.64219 3rd Qu.: 0.78255 3rd Qu.: 0.80137
Max. : 3.10670 Max. : 3.26225 Max. : 1.83458 Max. : 3.3630 Max. : 1.65671 Max. : 3.17633 Max. : 1.64673 Max. : 1.56457
Here are scaled distributions (histograms and density plots) for each numeric variable, including Sales. The variables relating to money ($) tend to be approximately normal. Many other variables tend to be approximately uniform, which does not bode well for their predictive power.
scaled.nums %>%
tidyr::gather() %>%
ggplot(aes(x=value,y=..density..))+
ggtitle('Distributions of Continous Variables (scaled)') +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill=I("orange"), col=I("black")) +
facet_wrap(~ key, scales = "free") +
geom_density(color="blue", fill='light blue', alpha = 0.4)Here we plot all numeric variables against their distributions. This is just another way to examine the information shown above.
scaled.nums %>%
tidyr::gather() %>%
plot_ly(x=~key, y=~value,
type = "box",
boxpoints = "all",
jitter = 0.4,
pointpos = 0,
color = ~key,
colors = "Dark2") %>%
subplot(shareX = TRUE) %>%
layout(title = "Numeric Variable Distributions (scaled)",
yaxis=list(title='Standard Deviation'),
xaxis=list(title='Variable'),
autosize=FALSE,
width=900,
height=900)Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
Here we plot all numeric variables against Sales (scaled). This allows us to investigate possible linear relationships between that variable and Sales. As shown below, only ‘Price’ appears to have a linear relationshop worth investigating.
scaled.nums %>%
tidyr::gather(-Sales, key = "var", value = "value") %>%
split(.$var) %>%
lapply(function(d) plot_ly(d, x=~value, y=~Sales,
mode = 'markers',
type = 'scatter',
colors = 'Set3',
size = ~Sales^2,
marker = list(line = list(
color = 'black',
width = 2)),
name=~var)) %>%
subplot(nrows=7, shareY=TRUE) %>%
layout(title = "Scatterplot Numeric vs Sales (scaled)<br>Size=Sales^2",
autosize=FALSE,
width=900,
height=1500,
yaxis=list(title='Sales'))`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
By adding naive regression lines to a few scatterplots we can confirm our suspicions:
fit.Pop <- lm(Sales ~ Population, data = scaled.nums)
fit.Age <- lm(Sales ~ Age, data = scaled.nums)
fit.CompPrice <- lm(Sales ~ CompPrice, data = scaled.nums)
fit.Price <- lm(Sales ~ Price, data = scaled.nums)
p1 <- plot_ly(scaled.nums,
x = ~Population,
name = 'Population vs Sales Regression Line') %>%
add_markers(y = ~Sales,
name = 'Population vs Sales Observations') %>%
add_lines(x = ~Population,
y = fitted(fit.Pop))
p2 <- plot_ly(scaled.nums,
x = ~Age,
name = 'Age vs Sales Regression Line') %>%
add_markers(y = ~Sales,
name = 'Age vs Sales Observations') %>%
add_lines(x = ~Age,
y = fitted(fit.Age))
p3 <- plot_ly(scaled.nums,
x = ~CompPrice,
name = 'CompPrice vs Sales Regression Line') %>%
add_markers(y = ~Sales,
name = 'CompPrice vs Sales Observations') %>%
add_lines(x = ~CompPrice,
y = fitted(fit.CompPrice))
p4 <- plot_ly(scaled.nums,
x = ~Price,
name = 'Price vs Sales Regression Line') %>%
add_markers(y = ~Sales,
name = 'Price vs Sales Observations') %>%
add_lines(x = ~Price,
y = fitted(fit.Price))
subplot(p1, p2, p3, p4, nrows=2, shareY=TRUE) %>%
layout(title = "Scatterplot Numeric vs Sales (scaled) <br> With Regression Lines",
autosize=FALSE,
width=1000,
height=700,
yaxis=list(title='Sales'))Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
Let’s compare the slopes:
scaled.nums %>%
plot_ly(y = ~Sales) %>%
add_lines(x= ~Population, y = fitted(fit.Pop),
name = "fit.Pop slope", line = list(shape = "linear")) %>%
add_lines(x= ~Age, y = fitted(fit.Age),
name = "fit.Age slope", line = list(shape = "linear")) %>%
add_lines(x= ~CompPrice, y = fitted(fit.CompPrice),
name = "fit.CompPrice slope", line = list(shape = "linear")) %>%
add_lines(x= ~Price, y = fitted(fit.Price),
name = "fit.Price slope", line = list(shape = "linear")) %>%
layout(title = "Regression Lines vs Sales (scaled)",
autosize=FALSE,
width=700,
height=500,
yaxis=list(title='Sales'),
xaxis=list(title='Scaled Numeric Variable'))Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
Here’s a pretty graphic that doesn’t help me understand anything about the data.
y = scaled.nums$Sales
x = scaled.nums$Price
s <- subplot(
plot_ly(x = x, color = I("black"), type = 'histogram'),
plotly_empty(),
plot_ly(x = x, y = y, type = 'histogram2dcontour', showscale = F),
plot_ly(y = y, color = I("black"), type = 'histogram'),
nrows = 2, heights = c(0.2, 0.8), widths = c(0.8, 0.2),
shareX = TRUE, shareY = TRUE, titleX = FALSE, titleY = FALSE)No trace type specified and no positional attributes specifiedNo trace type specified:
Based on info supplied, a 'scatter' trace seems appropriate.
Read more about this trace type -> https://plot.ly/r/reference/#scatter
No scatter mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
layout(s, showlegend = FALSE, autosize=FALSE,
width=700,
height=500,
yaxis=list(title='Sales'),
xaxis=list(title='Price'))Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
First, let’s create a dataframe that we can use:
Let’s try ’em all at once. This works, but I don’t manage the colors very well.
colr = brewer.pal(9, "Set1")
categorical.by.sales %>%
tidyr::gather(-Sales, key = "var", value = "value") %>%
split(.$var) %>%
lapply(function(d) plot_ly(d, x=~paste(var, '<br>' , value), y=~Sales,
type = "box",
boxpoints = "all",
jitter = .2,
pointpos = 0,
color =~paste(var, ":", value),
colors = sample(colr, length(unique(d$value))))) %>%
# colors = "Set1")) %>%
subplot(shareY = TRUE, nrows=3) %>%
layout(title = "Categorical Variable Distributions vs Sales (scaled)",
yaxis=list(title='Sales Standard Deviation'),
xaxis=list(title=''),
autosize=FALSE,
width=900,
height=1500)attributes are not identical across measure variables;
they will be droppedSpecifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
I run these each separately and get much nicer plots.
categorical.by.sales %>%
plot_ly(x = ~US, y = ~Sales,
split = ~US,
type = 'violin',
box = list(visible = TRUE),
meanline = list(visible = TRUE)) %>%
layout(xaxis = list(title = "US"),
yaxis = list(title = "Sales",zeroline = FALSE))categorical.by.sales %>%
plot_ly(x = ~Urban, y = ~Sales,
split = ~Urban,
type = 'violin',
box = list(visible = TRUE),
meanline = list(visible = TRUE)) %>%
layout(xaxis = list(title = "Urban"),
yaxis = list(title = "Sales",zeroline = FALSE))Now this is interesting. We suspected that ‘ShelveLoc’ would be important based on one of the early scatterplots. It seems that this is the case.
First, let’s merge the data set into a single dataframe
First, let’s look at some things that may give us trouble. Luckily it looks like the only serious correlation is with our dependent variable.
res <- cor(scaled.nums)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)It appears that residuals are roughly symmetrical around 0. That’s strange. Mostly due to a relatively poor overall fit. Note how close to zero most of the coefficient estimates are.
Call:
lm(formula = Sales ~ ., data = scaled.merged)
Residuals:
Min 1Q Median 3Q Max
-1.32467 -0.29881 -0.01677 0.27320 1.82851
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.70000 0.06018 -11.633 < 2e-16 ***
CompPrice 0.44976 0.02515 17.883 < 2e-16 ***
Income 0.11240 0.02029 5.539 4.78e-08 ***
Advertising 0.24069 0.02389 10.076 < 2e-16 ***
Population 0.01656 0.02134 0.776 0.438
Price -0.70235 0.02530 -27.759 < 2e-16 ***
Age -0.23287 0.02012 -11.576 < 2e-16 ***
Education -0.02619 0.02029 -1.291 0.197
ShelveLocGood 1.39901 0.06043 23.152 < 2e-16 ***
ShelveLocMedium 0.59700 0.04974 12.003 < 2e-16 ***
UrbanYes 0.03785 0.04353 0.870 0.385
USYes 0.06868 0.04773 1.439 0.151
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4672 on 538 degrees of freedom
Multiple R-squared: 0.7543, Adjusted R-squared: 0.7492
F-statistic: 150.1 on 11 and 538 DF, p-value: < 2.2e-16
Here we define a new model with some interaction terms: a. Income and Advertising b. Income and CompPrice c. Price and Age
interaction.lm <- lm(Sales~. + Income*Advertising + Income*CompPrice + Price*Age, data=scaled.merged)
summary(interaction.lm)
Call:
lm(formula = Sales ~ . + Income * Advertising + Income * CompPrice +
Price * Age, data = scaled.merged)
Residuals:
Min 1Q Median 3Q Max
-1.31721 -0.30723 -0.01874 0.28583 1.91559
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.69732 0.05993 -11.635 < 2e-16 ***
CompPrice 0.44981 0.02500 17.991 < 2e-16 ***
Income 0.10889 0.02028 5.369 1.18e-07 ***
Advertising 0.23790 0.02377 10.008 < 2e-16 ***
Population 0.01367 0.02121 0.645 0.5194
Price -0.70114 0.02513 -27.903 < 2e-16 ***
Age -0.23142 0.02003 -11.553 < 2e-16 ***
Education -0.03038 0.02024 -1.501 0.1339
ShelveLocGood 1.39095 0.06040 23.028 < 2e-16 ***
ShelveLocMedium 0.58444 0.04973 11.752 < 2e-16 ***
UrbanYes 0.03659 0.04323 0.846 0.3977
USYes 0.07076 0.04740 1.493 0.1360
Income:Advertising 0.03748 0.01999 1.875 0.0613 .
CompPrice:Income -0.05181 0.02053 -2.524 0.0119 *
Price:Age 0.01399 0.01991 0.703 0.4826
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4637 on 535 degrees of freedom
Multiple R-squared: 0.7593, Adjusted R-squared: 0.753
F-statistic: 120.5 on 14 and 535 DF, p-value: < 2.2e-16
Price Population CompPrice Income Education Age Advertising ShelveLoc
Plus the following interaction term: Income and Age Income and Advertising Income and CompPrice
Use the regsubsets command in “leaps” to do the full model ( do NOT specify really.big=T as it will require a long time to complete, along with forward and backward subset selection. Use the R file, Subset_select.R as a guide for syntax.
The result of the regressions is a model object, so you can query the proper coefficient of the model to determine which model, of the multiple models fit, produces the best metric.
Create a table, or simply list, for the full model, and forward and backward selection, which model # is the best, then write out the model, that is put the model in the form y=intercept + coefficient*variable, for the best model using the BIC measure. NOTE: these may be the same model, or may differ slightly.